Native Language Identification using large scale lexical features
نویسنده
چکیده
This paper describes an effort to perform Native Language Identification (NLI) using machine learning on a large amount of lexical features. The features were collected from sequences and collocations of bare word forms, suffixes and character n-grams amounting to a feature set of several hundred thousand features. These features were used to train a linear Support Vector Machine (SVM) classifier for predicting the native language category.
منابع مشابه
Exploring Syntactic Representations for Native Language Identification
Tree Substitution Grammar rules form a large and expressive class of features capable of representing syntactic and lexical patterns that provide evidence of an author’s native language. However, this class of features can be applied to any general constituent based model of grammar and previous work has done little to explore these options, relying primarily on the common Penn Treebank annotat...
متن کاملThe Use of Lexical Bundles in Native and Non-native Post-graduate Writing: The Case of Applied Linguistics MA Theses
Connor et al. (2008) mention “specifying textual requirements of genres” (p.12) as one of the reasons which have motivated researchers in the analysis of writing. Members of each genre should be able to produce and retrieve these textual requirements appropriately to be considered communicatively proficient. One of the textual requirements of genres is regularities of specific forms and content...
متن کاملFeature Space Selection and Combination for Native Language Identification
We decribe the submissions made by the National Research Council Canada to the Native Language Identification (NLI) shared task. Our submissions rely on a Support Vector Machine classifier, various feature spaces using a variety of lexical, spelling, and syntactic features, and on a simple model combination strategy relying on a majority vote between classifiers. Somewhat surprisingly, a classi...
متن کاملNative Language Identification: a Simple n-gram Based Approach
This paper describes our approaches to Native Language Identification (NLI) for the NLI shared task 2013. NLI as a sub area of author profiling focuses on identifying the first language of an author given a text in his second language. Researchers have reported several sets of features that have achieved relatively good performance in this task. The type of features used in such works are: lexi...
متن کاملTopic Modeling for Native Language Identification
Native language identification (NLI) is the task of determining the native language of an author writing in a second language. Several pieces of earlier work have found that features such as function words, part-of-speech n-grams and syntactic structure are helpful in NLI, perhaps representing characteristic errors of different native language speakers. This paper looks at the idea of using Lat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013